Machine Learning Analysis Report

Generated on August 03, 2025 at 08:44 PM

Machine Learning Analysis Pipeline

EDR: Dataset Loading & Preprocessing

EDR – Train/Test Overview
• Train shape: (88089, 20) | Test shape: (7533, 20)
• Total train samples: 88,089 | Total test samples: 7,533
• Number of features: 16
• Target column: 'label'
• Missing values (train): 0 | (test): 0
EDR – Train Class Distribution
• 0: 87,232
• 1: 857
• Class balance (minority/majority): 0.9824%
EDR – Feature Preparation
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
⚠️ Extreme Class Imbalance Detected
• Minority class represents only 0.9824% of the data
• This extreme imbalance may cause models to predict everything as majority class
• Consider: more aggressive SMOTE ratios, cost-sensitive learning, or ensemble methods
• Metrics like Precision-Recall AUC and F1 are more meaningful than accuracy
Baseline (Most-Frequent) Accuracy: 0.9902

EDR: Model Performance Comparison

EDR – Model Performance Metrics

ModelAccuracyBalanced AccPrecisionRecallF1ROC-AUCPR-AUC
Logistic Regression0.93610.64000.05470.33780.09420.64470.0498
Random Forest (SMOTE)0.87030.60670.02620.33780.04870.80740.0548
LightGBM0.84180.67260.03100.50000.05850.81830.0980
Balanced RF0.88240.68640.04070.48650.07520.85770.0906
SGD SVM0.94800.58570.04570.21620.0755nannan
IsolationForest0.98210.54940.10390.10810.1060nannan

Confusion Matrix Analysis

ModelTNFPFNTPFP RateMiss Rate
Logistic Regression702743249255.79%66.22%
Random Forest (SMOTE)6531928492512.44%66.22%
LightGBM63041155373715.48%50.00%
Balanced RF6611848383611.37%51.35%
SGD SVM712533458164.48%78.38%
IsolationForest7390696680.93%89.19%

Best Models by Metric

Accuracy
IsolationForest
0.9821
Balanced Acc
Balanced RF
0.6864
Precision
IsolationForest
0.1039
Recall
LightGBM
0.5000
F1
IsolationForest
0.1060
ROC-AUC
Balanced RF
0.8577
PR-AUC
LightGBM
0.0980
Lowest False Positive Rate
IsolationForest
0.93%
Lowest Miss Rate
LightGBM
50.00%

EDR – Metrics by Model

EDR – Metrics by Model

EDR – ROC Curves

EDR – ROC Curves

EDR – Precision–Recall Curves

EDR – Precision–Recall Curves

EDR – Predicted Probability Distributions

EDR – Predicted Probability Distributions

EDR – Threshold Sweep

EDR – Threshold Sweep

EDR: Logistic Regression – Detailed Analysis

EDR – Logistic Regression: Confusion Matrix

EDR – Logistic Regression: Confusion Matrix

EDR – Logistic Regression: Confusion Matrix

EDR – Logistic Regression: Classification Report

Modelprecisionrecallf1support
00.99310.94210.96697459.0000
10.05470.33780.094274.0000
accuracynannan0.93617533.0000

EDR – Logistic Regression: Feature Importance

EDR – Logistic Regression: Feature Importance

EDR – Logistic Regression: Feature Importance

EDR: Random Forest (SMOTE) – Detailed Analysis

EDR – Random Forest (SMOTE): Confusion Matrix

EDR – Random Forest (SMOTE): Confusion Matrix

EDR – Random Forest (SMOTE): Confusion Matrix

EDR – Random Forest (SMOTE): Classification Report

Modelprecisionrecallf1support
00.99260.87560.93047459.0000
10.02620.33780.048774.0000
accuracynannan0.87037533.0000

EDR – Random Forest (SMOTE): Feature Importance

EDR – Random Forest (SMOTE): Feature Importance

EDR – Random Forest (SMOTE): Feature Importance

EDR: LightGBM – Detailed Analysis

EDR – LightGBM: Confusion Matrix

EDR – LightGBM: Confusion Matrix

EDR – LightGBM: Confusion Matrix

EDR – LightGBM: Classification Report

Modelprecisionrecallf1support
00.99420.84520.91367459.0000
10.03100.50000.058574.0000
accuracynannan0.84187533.0000

EDR – LightGBM: Feature Importance

EDR – LightGBM: Feature Importance

EDR – LightGBM: Feature Importance

EDR: Balanced RF – Detailed Analysis

EDR – Balanced RF: Confusion Matrix

EDR – Balanced RF: Confusion Matrix

EDR – Balanced RF: Confusion Matrix

EDR – Balanced RF: Classification Report

Modelprecisionrecallf1support
00.99430.88630.93727459.0000
10.04070.48650.075274.0000
accuracynannan0.88247533.0000

EDR – Balanced RF: Feature Importance

EDR – Balanced RF: Feature Importance

EDR – Balanced RF: Feature Importance

EDR: SGD SVM – Detailed Analysis

EDR – SGD SVM: Confusion Matrix

EDR – SGD SVM: Confusion Matrix

EDR – SGD SVM: Confusion Matrix

EDR – SGD SVM: Classification Report

Modelprecisionrecallf1support
00.99190.95520.97327459.0000
10.04570.21620.075574.0000
accuracynannan0.94807533.0000

EDR – SGD SVM: Feature Importance

EDR – SGD SVM: Feature Importance

EDR – SGD SVM: Feature Importance

EDR: IsolationForest – Detailed Analysis

EDR – IsolationForest: Confusion Matrix

EDR – IsolationForest: Confusion Matrix

EDR – IsolationForest: Confusion Matrix

EDR – IsolationForest: Classification Report

Modelprecisionrecallf1support
00.99110.99070.99097459.0000
10.10390.10810.106074.0000
accuracynannan0.98217533.0000

EDR – IsolationForest: Feature Importance

Feature importance not available for this model type.

XDR: Dataset Loading & Preprocessing

XDR – Train/Test Overview
• Train shape: (88089, 34) | Test shape: (7533, 34)
• Total train samples: 88,089 | Total test samples: 7,533
• Number of features: 30
• Target column: 'label'
• Missing values (train): 0 | (test): 0
XDR – Train Class Distribution
• 0: 87,232
• 1: 857
• Class balance (minority/majority): 0.9824%
XDR – Feature Preparation
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
⚠️ Extreme Class Imbalance Detected
• Minority class represents only 0.9824% of the data
• This extreme imbalance may cause models to predict everything as majority class
• Consider: more aggressive SMOTE ratios, cost-sensitive learning, or ensemble methods
• Metrics like Precision-Recall AUC and F1 are more meaningful than accuracy
Baseline (Most-Frequent) Accuracy: 0.9902

XDR: Model Performance Comparison

XDR – Model Performance Metrics

ModelAccuracyBalanced AccPrecisionRecallF1ROC-AUCPR-AUC
Logistic Regression0.93560.59960.04230.25680.07270.65600.0462
Random Forest (SMOTE)0.90340.59000.02880.27030.05210.79880.0670
LightGBM0.87560.68970.03950.50000.07320.85090.1250
Balanced RF0.89370.69210.04510.48650.08250.85880.0929
SGD SVM0.81450.57860.01820.33780.0346nannan
IsolationForest0.98810.51900.13640.04050.0625nannan

Confusion Matrix Analysis

ModelTNFPFNTPFP RateMiss Rate
Logistic Regression702943055195.76%74.32%
Random Forest (SMOTE)678567454209.04%72.97%
LightGBM6559900373712.07%50.00%
Balanced RF6696763383610.23%51.35%
SGD SVM61111348492518.07%66.22%
IsolationForest7440197130.25%95.95%

Best Models by Metric

Accuracy
IsolationForest
0.9881
Balanced Acc
Balanced RF
0.6921
Precision
IsolationForest
0.1364
Recall
LightGBM
0.5000
F1
Balanced RF
0.0825
ROC-AUC
Balanced RF
0.8588
PR-AUC
LightGBM
0.1250
Lowest False Positive Rate
IsolationForest
0.25%
Lowest Miss Rate
LightGBM
50.00%

XDR – Metrics by Model

XDR – Metrics by Model

XDR – ROC Curves

XDR – ROC Curves

XDR – Precision–Recall Curves

XDR – Precision–Recall Curves

XDR – Predicted Probability Distributions

XDR – Predicted Probability Distributions

XDR – Threshold Sweep

XDR – Threshold Sweep

XDR: Logistic Regression – Detailed Analysis

XDR – Logistic Regression: Confusion Matrix

XDR – Logistic Regression: Confusion Matrix

XDR – Logistic Regression: Confusion Matrix

XDR – Logistic Regression: Classification Report

Modelprecisionrecallf1support
00.99220.94240.96677459.0000
10.04230.25680.072774.0000
accuracynannan0.93567533.0000

XDR – Logistic Regression: Feature Importance

XDR – Logistic Regression: Feature Importance

XDR – Logistic Regression: Feature Importance

XDR: Random Forest (SMOTE) – Detailed Analysis

XDR – Random Forest (SMOTE): Confusion Matrix

XDR – Random Forest (SMOTE): Confusion Matrix

XDR – Random Forest (SMOTE): Confusion Matrix

XDR – Random Forest (SMOTE): Classification Report

Modelprecisionrecallf1support
00.99210.90960.94917459.0000
10.02880.27030.052174.0000
accuracynannan0.90347533.0000

XDR – Random Forest (SMOTE): Feature Importance

XDR – Random Forest (SMOTE): Feature Importance

XDR – Random Forest (SMOTE): Feature Importance

XDR: LightGBM – Detailed Analysis

XDR – LightGBM: Confusion Matrix

XDR – LightGBM: Confusion Matrix

XDR – LightGBM: Confusion Matrix

XDR – LightGBM: Classification Report

Modelprecisionrecallf1support
00.99440.87930.93337459.0000
10.03950.50000.073274.0000
accuracynannan0.87567533.0000

XDR – LightGBM: Feature Importance

XDR – LightGBM: Feature Importance

XDR – LightGBM: Feature Importance

XDR: Balanced RF – Detailed Analysis

XDR – Balanced RF: Confusion Matrix

XDR – Balanced RF: Confusion Matrix

XDR – Balanced RF: Confusion Matrix

XDR – Balanced RF: Classification Report

Modelprecisionrecallf1support
00.99440.89770.94367459.0000
10.04510.48650.082574.0000
accuracynannan0.89377533.0000

XDR – Balanced RF: Feature Importance

XDR – Balanced RF: Feature Importance

XDR – Balanced RF: Feature Importance

XDR: SGD SVM – Detailed Analysis

XDR – SGD SVM: Confusion Matrix

XDR – SGD SVM: Confusion Matrix

XDR – SGD SVM: Confusion Matrix

XDR – SGD SVM: Classification Report

Modelprecisionrecallf1support
00.99200.81930.89747459.0000
10.01820.33780.034674.0000
accuracynannan0.81457533.0000

XDR – SGD SVM: Feature Importance

XDR – SGD SVM: Feature Importance

XDR – SGD SVM: Feature Importance

XDR: IsolationForest – Detailed Analysis

XDR – IsolationForest: Confusion Matrix

XDR – IsolationForest: Confusion Matrix

XDR – IsolationForest: Confusion Matrix

XDR – IsolationForest: Classification Report

Modelprecisionrecallf1support
00.99050.99750.99407459.0000
10.13640.04050.062574.0000
accuracynannan0.98817533.0000

XDR – IsolationForest: Feature Importance

Feature importance not available for this model type.